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Abstract 

This  paper  explores  the  problem  of  efficiently  permuting  data  stored  in  VLSI  chips  in 
accordance  with  a  predetermined  set  of  permutations.  By  connecting  chips  uith  shared  bus 
interconnections,  as  opposed  to  point-to-point  interconnections,  we  show  that  the  number 
of  pins  per  chip  can  often  be  reduced.  For  example,  for  infinitely  many  n ,  we  exhibit 
permutation  architectures  with  \Jh  1  pins  per  chip  that  can  realize  any  of  the  n  cyclic  shifts 
on  n  chips  in  one  clock  tick.  When  the  set  of  permutations  forms  a  group  with  p  elements, 
any  permutation  in  the  group  can  be  realized  in  one  clock  lick  by  an  architecture  with 
0{Jp\g p)  pins  per  chip.  When  the  permutation  group  is  abelian,  0{.Jp)  pins  suffice. 
These  results  are  all  derived  from  a  mathematical  characterization  of  uniform  permutation 
architectures  based  on  the  combinatorial  notion  of  a  difference  cover.  We  also  consider 
uniform  permutation  architectures  that  realize  permutations  in  several  clock  ticks,  instead 
of  one,  and  show  that  further  savings  in  the  number  of  pins  per  chip  can  be  obtained. 
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This  paper  explores  the  problem  of  efficiently  permuting  data  stored  in  VLSI 
chips  in  accordance  with  a  predetermined  set  of  permutations.  By  connecting  chips 
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1  Introduction 


The  organization  of  communication  among  chips  is  a  major  concern  in  the  design  of  an 
electronic  system.  Because  of  the  costs  associated  with  wiring  and  packaging,  it  is  generally 
desirable  to  minimize  the  number  of  wires  and  the  number  of  pins  per  chip  in  an  architec¬ 
ture.  This  paper  investigates  how  busses  (multiple- pin  wires)  can  be  employed  to  efficiently 
implement  various  communication  patterns  among  a  set  of  chips.  Other  theoretical  studies 
of  bussed  interconnections  can  be  found  in  [1,  3,  4,  5,  7,  12,  21,  24,  25,  29]. 

Perhaps  the  simplest  example  of  the  advantage  of  bussed  interconnections  is  the  use  of 
a  single  shared  bus  to  communicate  between  any  pair  of  chips  connected  to  the  bus  in  one 
clock  tick.  Communicating  between  any  pair  of  chips  in  one  clock  tick  can  be  implemented 
with  two-pin  wires,  but  any  such  scheme  requires  ( jj  wires  and  n  —  1  pins  per  chip.’  Of 
course,  a  two-pin  interconnection  scheme  may  be  aole  to  implement  more  communication 
patterns,  but  if  we  are  only  interested  in  communication  between  individual  pairs,  the 
additional  power,  which  comes  at  a  high  cost,  is  wasted. 

An  example  that  better  illustrates  the  ideas  in  this  paper  comes  from  the  problem  of 
building  a  fast  cyclic  shifter  (sometimes  called  a  barrel  shifter)  on  n  chips.  Initially,  each 
chip  c  contains  a  one-bit  value  c^.  The  function  of  the  shifter  is  to  move  each  bit  tc  to  chip 
c  4-  s  (mod  n)  in  one  clock  tick,  where  s  can  be  any  value  between  0  and  n  —  1 . 

Any  cyclic  shifter  that  uses  only  two-pin  wires  requires  at  least  (2)  wires  and  n  —  1  pins 
per  chip  in  order  to  shift  in  one  clock  tick  because  each  chip  must  be  able  to  communicate 
directly  with  each  of  the  other  n  —  1  chips.  Using  busses,  however,  we  can  do  much  better. 
Figure  1  gives  an  architecture  for  a  cyclic  shifter  on  13  chips  which  uses  13  busses  and  only 
4  pins  per  chip.  To  realize  a  shift  by  8,  for  example,  each  chip  writes  its  bit  to  pin  3  and 
reads  from  pin  1.  The  reader  may  verify  that  all  other  cyclic  shifts  among  the  chips  are 
possible  in  one  clock  tick.  (In  Section  4,  we  give  a  general  method  for  constructing  such 
cyclic  shifters  bcised  on  finite  projective  planes.) 


Figure  T.  A  cyclic  shifter  on  13  chips  that  uses  13  busses.  Each  chip  has  4  pins,  and  each  bus 
has  4  chips  connected  to  it.  This  cyclic  shifter  is  based  on  the  difference  cover  {0, 1,3,9}  for  Z13. 

The  cyclic  shifter  of  Figure  1  has  the  advantage  of  uniformity.  All  chips  have  exactly 
the  same  number  of  pins,  and  to  accomplish  each  of  the  13  permutations  specified  by  the 

’Unless  otherwise  specified,  we  count  only  data  pins  in  our  analysis  and  omit  consideration  of  the  pins 
for  control,  clock,  power,  and  ground  since  they  are  needed  by  all  implementations. 
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problem,  all  chips  write  to  (and  read  from)  pins  with  identical  labels.  For  all  busses,  the 
number  of  pins  per  bus  is  4,  which  is  the  same  as  the  number  of  pins  per  chip.  Moreover, 
the  connections  between  chips  and  busses  follow  a  periodic  pattern.  The  uniformity  of  the 
architecture  leads  to  simplicity  in  the  control  of  the  system.  Four  control  wires  from  a 
central  controller  are  sufficient  to  determine  each  of  the  13  shifts — two  wires  for  specifying 
the  number  of  the  pin  on  which  to  write,  and  two  for  the  pin  to  read — w'hich  is  the 
minimum  possible.  Thus,  our  control  scheme  uses  the  minimum  number  of  control  pins, 
and  the  on-chip  decoding  logic  is  straightforward  and  identical  for  all  the  chips. 

Cyclic  shifters  for  general  n  can  be  constructed  using  an  idea  from  combinatorial  math¬ 
ematics  related  to  difference  sets  [18,  p.  121).  (See  eilso  [6,  14,  16,  22,  26].) 

Definition  1  A  subset  D  C  Z„  of  the  integers  modulo  n  is  a  difference  cover  for  Z„  if  for 
all  s  €  Z„,  there  exist  di,  dj  E  D  such  that  s  =  dj  —  dj  (mod  n). 

That  is,  every  integer  in  Z„  can  be  represented  as  the  difference  modulo  n  of  two  integers 
in  D.  For  example,  the  set  D  =  {0, 1,3,9}  is  a  difference  cover  for  Z13,  since 

0  =  0-0 

1  =  1-0 

2  =  3-1 

3  =  3-0 

4  =  0-9 

5  =  1-9 

6  =  9-3 

7  =  3-9 

8  =  9-1 

9  =  9-0 

10  =  0-3 

11  =  1-3 

12  =  0  -  1  , 

where  all  subtractions  are  performed  modulo  13. 

Given  a  difference  cover  for  Z„  with  k  elements,  a  cyclic  shifter  on  n  chips  with  n  busses 
and  k  pins  per  chip  can  be  constructed.  Suppose  D  =  {do>  di, . . . ,  d*_i}  is  a  difference 
cover  for  Z„.  In  the  cyclic  shifter,  chip  c  connects  via  its  pin  t  to  bus  c  -f-  d,  (modn),  for 
all  c  =  0, 1, . . .  ,n  —  1  and  i  =  0, 1, . . . ,  A  —  1.  To  see  that  any  cyclic  shift  on  the  n  chips 
can  be  uniformly  realized,  consider  a  cyclic  shift  by  s.  Since  D  is  a  difference  cover  for  Z„, 
there  exist  d,, dj  €  D  such  that  s  =  di  —  dj  (modn).  To  reiilize  the  shift  by  s.  each  chip 
writes  to  pin  i  and  reads  from  pin  j.  Chip  c  therefore  writes  onto  bus  c  +  d,,  and  bus  c-fd, 
is  read  by  chip  (c  +  di)  —  dj  —  c-\-  s.  No  collisions  occur  because  each  bus  has  exactly  one 
pin  labeled  i  and  one  pin  labeled  j  connected  to  it,  as  can  be  verified. 

The  remainder  of  this  paper  explores  permutation  architectures,  the  properties  of 
multiple-pin  interconnections,  and  related  combinatorial  mathematics.  In  Section  2  we 
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define  a  permutation  architecture,  introduce  the  notion  of  uniformity,  and  prove  some  ba¬ 
sic  properties  of  architectures  that  employ  busses  to  realize  arbitrary  sets  of  permutations. 
Section  3  defines  the  notion  of  a  difference  cover  for  a  set  permutations,  relates  it  to 
the  notion  of  a  uniform  permutation  architecture,  and  proves  some  properties  of  difference 
covers.  In  Section  4  we  show  how  to  build  cyclic  shifters  that  are  provably  efficient.  Sec¬ 
tion  5  investigates  how  to  design  small  difference  covers  for  any  set  of  permutations  that 
forms  a  finite  group.  In  Section  6  we  extend  the  discussion  to  uniform  architectures  that 
realize  permutations  in  more  than  one  clock  tick.  We  present  a  variety  of  extensions  to  the 
results  of  the  paper  in  Section  7.  Finally,  in  Section  8  we  discuss  questions  left  open  by 
our  research.  An  appendix  of  standard  notations  and  definitions  is  included  for  reference. 
Notations  and  definitions  more  specific  to  the  content  of  the  paper  are  provided  in  context. 


2  Permutation  architectures 

In  this  section  we  formally  define  the  notion  of  a  permutation  architecture,  and  we  make 
precise  the  notion  of  uniformity.  We  also  prove  some  basic  properties  of  permutation 
architectures  that  realize  arbitrary  sets  of  permutations.  The  definitions  in  this  section  are 
somewhat  intricate  and  tedious,  and  are  indicative  of  the  difficulties  faced  in  the  design  of 
efficient  permutation  architectures.  In  the  next  section,  however,  we  use  these  definitions 
to  show  that  reasoning  about  uniform  permutation  architectures  is  essentially  equivalent 
to  reasoning  about  difference  covers,  a  simpler  and  more  elegant  mathematical  notion.  The 
remainder  of  the  paper  then  uses  the  simpler  notion. 

For  convenience,  we  adopt  a  few  notational  conventions.  We  use  multiplicative  notation 
to  denote  composition  of  permutations.  The  inverse  of  a  permutation  tt  is  denoted  by 
7r~T  Composition  of  functions  is  performed  in  right-to-left  order,  so  that  7ri7r2  is  defined 
by  TTiTTji  =  7ri(7r2(i)).  The  identity  permutation  on  n  elements  is  denoted  by  /„.  or 
by  I  if  the  number  of  elements  is  unimportant.  For  a  permutation  set  4>,  we  denote 
by  the  set  of  all  the  inverses  of  the  permutations  of  $,  i.e.  :  4>  £  ^}- 

For  two  permutation  sets  <[>  and  the  notation  is  used  to  denote  the  permutation 
set  {(f>'4>  :  4>  ^  ^  and  ip  £  We  use  the  notation  [n]  to  denote  the  set  of  n  integers 
{0,1, ...,n  -  1}. 

We  first  define  the  notion  of  a  permutation  architecture. 

Definition  2  A  permutation  architecture  is  a  6- tuple  A  =  (C,  B,  F,  CHIP,  BUS,  LABEL)  as 
follows. 

1.  C  is  a  set  of  chips; 

2.  B  is  s'*t  of  busses; 

3.  B  is  a  set  of  pins; 

4.  CHIP  is  a  function  CHIP  :  P  C; 
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5  BUS  is  a  function  BUS  \  P  B\ 

6.  LABEL  is  a  function  LABEL  :  P  — *•  N,  where  x,y  ^  P,  x  ^  y,  and  CHIP(t)  = 
CHIP(3/),  then  LABEL(j)  ^  LABEL(y). 

The  set  C  contains  all  the  chips  in  the  architecture,  and  the  set  B  contains  all  the  busses. 
Which  chips  a'^e  connected  to  which  busses  is  determined  by  the  pins  they  have  in  common; 
the  set  P  contains  all  the  pins.  The  function  CHIP  determines  which  pins  belong  to  which 
chips.  Similarly,  the  function  BUS  determines  which  pins  are  interconnected  by  which  bus. 
The  function  LABEL  names  the  pins  on  the  chips  by  natural  numbers  such  that  all  pins  on 
a  given  chip  have  distinct  labels,  which  we  shall  sometimes  call  pin  numbers. 

Our  formal  definition  of  a  permutation  architecture  omits  several  subsystems  that  tech¬ 
nically  should  be  included,  but  whose  inclusion  is  not  germane  to  our  study.  These  sub¬ 
systems  include  a  control  network  that  specifies  what  permutation  is  to  be  performed  and 
clocking  circuitry  for  synchronization.  Our  focus  is  on  the  structure  of  the  bussed  inter¬ 
connections  for  permuting  the  data,  and  thus  our  definition  encompasses  only  this  aspect 
of  the  architecture. 

We  now  define  what  it  means  for  a  permutation  architecture  to  realize  a  permutation. 

Definition  3  A  permutation  arcuitecture  A  =  {C,  B.  P,  CHIP,  BUS,  LABEL)  realizes  a  per¬ 
mutation  TT  :  C  — ►  C  if  there  exist  two  functions  WRITE„  :  C  —*  P  and  RE.AD^  :  C  P. 
such  that  for  any  chips  c,  ci,C2  €  C,  we  have: 

1.  CHIP(READ,(c))  =  CHIP(WRITE,(c))  =  c; 

2.  BUS(WRITE,(c))  =  BUS(READ,(7r(c))); 

3.  Cl  ^  C2  implies  BUs(write,(ci  ))  BUS(write,(c2)). 

The  architecture  uniformly  realizes  tr  if,  in  addition; 

4.  LABEL(WRITE„(ci))  =  LABEL(WRITE,(c2)); 

5.  LABEL(READ,r(ci))  =  LABEL(READ,(c2)). 

We  say  a  permutation  architecture  realizes  a  set  11  of  permutations  if  it  realizes  every 
permutation  in  0.  We  say  it  uniformly  realizes  11  if  it  uniformly  realizes  every  permutation 
in  n. 

Intuitively,  for  a  permutation  tr,  the  functions  WRITE„  and  READ,  identify  the  write 
pin  and  the  read  pin  for  each  chip.  Condition  1  makes  sure  that  each  chip  writes  and  reads 
pins  that  are  connected  to  it.  Condition  2  ensures  that  the  bus  to  which  chip  c  writes  is 
read  by  chip  ir(c).  Condition  3  guarantees  that  no  collisions  occur,  that  is,  no  two  data 
transfers  use  the  same  bus.  The  architecture  uniformly  realizes  a  permutation  (Conditions 
4  and  5)  if  all  chips  write  to  pins  with  the  same  pin  number  and  read  from  pins  with  the 
same  pin  number,  as  in  the  cyclic  shifter  from  Figure  1. 
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Our  definition  of  a  permutation  architecture  implies  that  “complete”  permutations  are 
to  be  realized,  that  is,  every  chip  sends  exactly  one  datum  and  receives  exactly  one  datum. 
Moreover,  an  interconnection  is  required  even  when  a  chip  sends  a  datum  to  itself.  Since 
no  collisions  occur,  the  number  of  busses  in  the  architecture  must  be  at  least  the  number 
of  chips.  This  observation  leads  directly  to  the  following  theorem. 

Theorem  1  In  any  permutation  architecture  that  realizes  some  nonempty  permutation  set 
n,  the  average  number  of  pins  per  bus  is  at  most  the  average  number  of  pins  per  chip. 

Proof.  Let  A  =  (C,  B,  P,  CHIP,  BUS,  label)  be  a  permutation  architecture  for  11.  The 
average  number  of  pins  per  chip  is  \P\  f  [Cj,  and  the  average  number  of  pins  per  bus  is 
(P|  /  \B\.  Condition  3  of  Definition  3  says  that  for  any  permutation  tt  e  11,  any  two  distinct 
chips  are  mapped  to  distinct  busses.  Consequently,  we  get  that  |B|  >  \C\,  which  proves 
the  theorem.  I 

Under  the  assumption  that  no  interconnection  is  needed  for  a  chip  to  send  data  to 
itself.  Theorem  1  is  no  longer  applicable.  A  similar  theorem  can  be  proved  for  this  model, 
however,  which  involves  the  number  of  fixed  points  in  the  permutations  realized  by  the 
architecture.  Specifically,  suppose  the  architecture  realizes  a  set  11  of  permutations.  Define 
the  rank  of  a  permutation  tt  6  11  as  RANK(7r)  =  |{c  €  C  :  t(c)  c}|,  and  define  the  rank 
of  the  permutation  set  11  as  RANK(n)  =  max,r€n  RANK(n-).  The  analogue  to  Theorem  1 
states  that  the  ratio  between  the  average  number  of  pins  per  bus  and  the  average  number 
of  pins  per  chip  is  at  most  |C1  /RANK(n). 

In  any  ajchitecture  A  that  uniformly  realizes  a  permutation  set  11,  the  number  of  pins 
that  are  actually  used  to  uniformly  realize  11  is  the  same  for  all  chips,  and  additional  pins 
on  a  chip  are  unused.  Furthermore,  the  number  of  busses  used  in  realizing  any  permutation 
TT  €  n  is  equal  to  the  number  of  chips.  These  observations  lead  to  the  following  definition 
of  a  uniform  architecture. 

Definition  4  A  uniform  permutation  architecture  for  a  permutation  set  11  is  a  permuta¬ 
tion  architecture  A  =  (C,B,P,  CHIP,  BUS,  LABEL)  such  that: 

1.  A  uniformly  realizes  11, • 

2.  \{x  €  P  '■  CHIp(p)  ci}|  =  |{i  €  B  :  CHIP(p)  =  C2}|  for  any  two  chips  ci,C2  €  C\ 

3.  \B\  =  |C|; 

4.  if  X  p  and  LABEL(x)  =  LABEL(y),  then  BUS(x)  ^  BUS(y). 

Thus,  all  the  chips  in  a  uniform  permutation  architecture  have  the  same  number  of  pins 
(Condition  2),  the  number  of  busses  is  equal  to  the  number  of  chips  (Condition  3),  and 
the  labels  of  the  pins  on  any  bus  are  distinct  (Condition  4). 

The  following  theorem  demonstrates  that  any  permutation  architecture  that  uniformly 
realizes  some  permutation  set  11  can  be  made  into  a  uniform  architecture. 
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Theorem  2  Let  ^  =  (C,  5.  P,  CHIP,  BUS,  LABEL)  be  a  permutation  architecture  that  uni¬ 
formly  realizes  the  permutation  set  11,  and  let  k  be  the  smallest  number  of  pins  on  any 
chip  in  C.  Then  there  is  a  uniform  architecture  A'  =  (C',  B',  P',  CHIP',  BUS',  LABEL')  for 
n  u'ith  at  most  k  pins  per  chip. 

Proof  We  construct  the  uniform  architecture  A'  from  the  permutation  architecture 
A  in  two  steps.  First,  we  construct  an  intermediate  permutation  architecture 
A”  =  (C",  P",  P",  chip",  bus",  label")  by  removing  extraneous  pins  from  chips  in  A 
such  that  all  chips  end  up  with  the  szime  number  of  pins  per  chip  and  such  that  each  pin 
plays  a  role  in  uniformly  reaUzing  II.  Then,  the  busses  of  A"  are  reorganized  to  produce 
the  architecture  A'  in  such  a  way  that  the  number  of  busses  in  A'  is  equal  to  the  number 
of  chips.  We  assume  that  the  permutation  set  11  is  nonempty,  since  otherwise  the  theorem 
is  trivial. 

In  the  first  step,  we  remove  pins  that  are  unused  in  uniformly  realizing  11.  Since  A 
uniformly  realizes  11,  each  permutation  tt  €  IT  can  be  associated  with  a  distinct  pair  [i.j] 
of  pin  labels  corresponding  to  the  labels  that  all  chips  write  to  and  read  from  in  order  to 
realize  tt.  A  pin  is  unused  if  its  label  does  not  appear  in  any  of  these  |ni  pairs.  Removing 
the  unused  pins  results  in  the  architecture  A''  in  which  all  chips  have  the  same  number 
of  pins,  since  each  chip  has  exactly  one  pin  for  each  label  used  in  uniformly  realizing  11. 
The  permutation  architecture  A"  uniformly  realizes  II,  and  furthermore,  each  pin  is  used 
in  uniformly  realizing  some  6  11.  If  we  let  s  denote  the  number  of  pins  per  chip  in  A”. 
then  we  have  s  <  k,  since  originally  at  least  one  chip  had  k  pins  and  no  pins  were  added. 

In  the  second  step,  we  reorganize  the  busses  of  A"  to  produce  the  uniform  architecture 
A'  in  which  the  number  of  busses  is  equal  to  the  number  of  chips.  For  any  permutation 
architecture  that  realizes  a  nonempty  permutation  set,  the  number  of  busses  is  never 
smaller  that  the  number  of  chips.  Assume  without  loss  of  generality  that  C"  =  [n], 
B"  =  [m],  and  range(LABEL")  =  [s].  The  theorem  is  proved  if  the  architecture  A"  uses 
only  n  =  \C"\  busses,  but  in  general,  the  architecture  might  use  m  >  n  busses. 

W'e  define  a  collection  of  mappings  'P  =  {xpo,  . . . ,  where  for  each  0  <  i  <  s  - 1 , 

the  mapping  :  [n]  — ►  [m]  is  defined  to  be  0,(c)  =  b  if  and  only  if  chip  c  6  C"  is  connected 
via  its  pin  number  i  to  bus  b  €  B".  The  elements  of  ^  are  indeed  mappings  since  each 
chip  has  a  pin  numbered  i  for  each  0  <  i  <  s  —  1.  The  mappings  are  injective  (one-to-one), 
since  otherwise  two  pins  with  the  same  pin  number  would  be  connected  to  the  same  bus. 
and  both  pins  could  not  be  used  to  uniformly  reahze  permutations,  thereby  violating  the 
construction  of  A"  in  the  first  step.  The  collection  is  a  multiset,  since  it  may  be  that 
two  different  pin  numbers  i  ^  j  define  the  same  mapping  (i.e.  rhi  =  The  key  idea  is 
that  any  permutation  is  implemented  by  each  chip  writing  to  pin  t  and  reading  from  pin  j. 
thereby  employing  the  mapping  0,  to  write  data  from  the  n  chips  to  n  distinct  busses,  and 
the  inverse  of  the  mapping  xpj  to  read  data  from  the  same  n  busses  back  to  the  n  chips. 

We  now  show  how  to  reorganize  the  busses  of  A"  in  order  to  construct  a  uniform 
architecture  A'.  We  partition  into  /  equivalence  classes  U  U  ■  •  •  U  'I'l-i  such  that 
tl',  and  xfj  are  in  the  same  equivalence  class  if  and  only  if  range(0,)  =  range(?/’j).  This 
partitioning  has  the  property  that  if  tt  €  11,  then  there  exists  an  r  such  that  ir  = 
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where  t.',.  t',  €  'I'r-  (Recall  that  the  inverse  of  an  injective  mapping  f  :  [??1  —  [mj  is  def!:;-  *: 
as  the  mapping  :  range(c’)  — ♦  [n]  such  that  if  V’(c)  =  b.  then  v~^{b')  =  c.)  For  each  0  < 
r  <  /  — 1,  pick  a  bijeclion  (one-to-one.  onto)  :  rangefc)  — ►  [n].  where  rl'  is  any  mapping  in 
'I'..  (We  can  pick  a  bijection,  since  f  is  injective,  which  implies  |range(ti-)|  =  n.)  We  define 
the  architecture  A'  by  C  =  C".  B'  =  [nj.  P'  =  P",  CHIP'  =  CHIP".  LABEL'  =  LABEL",  and 
for  any  pin  x  €  P'  such  that  VLABEL'(r)  ^  we  define  BUS'(x)  =  /r(BUS"(j)). 

The  architecture  A'  has  exactly  s  pins  per  chip  and  satisfies  |B'|  =  jC'j  =  n,  thereby 
satisfying  Conditions  2  and  3  of  Definition  4.  We  show  Condition  4  holds  by  considering 
any  two  pins  i  and  y  with  LABEL'(x)  =  LABEL'(y)  =  i.  We  have  BL’S'(x)  =  j  i 

and  BUS'(y)  =  /r(BUS"(y))  for  some  /,  as  defined  in  the  previous  paragraph.  Since  fr  is 
an  injective  mapping  and  because  Condition  4  of  Definition  4  holds  for  A",  we  then  have 
X  ^  y  implies  BUS'(x)  BL’S'(i/). 

It  remains  to  show  that  Condition  1  of  Definition  4  holds,  that  is,  that  A!  uniformly 
realizes  11.  Consider  any  permutation  rr  6  0.  Since  -4"  uniformly  realizes  H,  there  exists  a 
pair  of  pin  labels  (i,  j)  such  that  r  is  realized  in  M'  by  each  chip  writing  to  its  pin  numbered 
i  and  reading  from  its  pin  numbered  j.  We  use  the  same  pin  labels  (z,  j)  to  realize  the 
permutation  r  in  A! .  Conditions  1,  4,  and  5  of  Definition  3  are  immediately  satisfied.  To 
verify  Conditions  2  and  3  we  use  the  following  observation.  In  architecture  A!'  chip  c  is 
connected  via  its  pin  labeled  h.  to  bus  while  in  architecture  A!  it  is  connected  to 

bus  where  tl'h.  €  Condition  2  now  holds  since  rr  =  =  (/rt; (/,L',  i. 

Condition  3  holds  since  /rV,  is  a  permutation  on  [nj.  We  therefore  conclude  that  A"  is  a 
uniform  architecture  for  11  with  at  most  k  pins  per  chip.  B 

The  next  theorem  provides  a  lower  bound  on  the  number  of  pins  per  chip  in  any  uniform 
architecture  for  a  permutation  set  11.  (A  related  theorem  due  to  C.  Fiduccia  appears  in 
[20,  p.  308].) 

Theorem  3  Let  A  =  (C,  B,  B,  CHIP.  BUS,  LABEL)  be  a  uniform  permutation  architecture 
for  a  permutation  set  11 .  Then  the  number  of  pins  per  chip  in  A  is  at  least  ^/ilTl  . 

Proof.  Because  architecture  A  realizes  11  uniformly,  we  can  associate  each  tt  G  11  with  a 
pair  (  j)  of  pin  numbers  such  that  tt  is  realized  by  each  chip  writing  to  its  pin  labeled 
i  and  reading  from  its  pin  labeled  j.  Since  A  is  uniform,  each  chip  has  exactly  |P|  /  |C| 
pins,  and  the  number  of  such  pairs  is  (|B|  /  |Cj)*.  No  two  permutations  can  be  associated 
with  the  same  pair,  and  thus,  we  have  (jB|  /  |C|)*  >  |n|  or  |B|  /  |C|  >  yjoj.  B 

A  permutation  architecture  can  often  nonuniformly  realize  many  more  permutations 
than  the  square  of  the  number  of  pins  per  chip.  As  an  example,  consider  a  “crossbar” 
architecture  of  n  chips  and  n  busses  where  each  chip  is  connected  to  each  bus.  This 
architecture  can  nonuniformly  realize  all  n!  permutations,  which  is  much  greater  than 
the  square  of  the  number  of  pins  per  chip.  In  Section  7  we  discuss  some  of  the  capabilities 
of  nonuniform  permutation  architectures. 
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3  Difference  covers 


In  this  section,  we  present  our  main  theorems  which  establish  the  relationship  between 
difference  covers  for  permutation  sets  and  uniform  permutation  architectures.  \,’e  also 
prove  some  lemmas  concerning  difference  covers  for  Cartesian  products  of  permutation 
sets.  Finally,  we  present  an  alternative  representation  for  difference  covers  called  substring 
covers  based  on  similar  notions  in  the  literature  of  difference  sets. 

We  first  provide  a  generalization  of  Definition  1  to  arbitrary  sets  of  permut  ,tions. 

Definition  5  A  difference  cover  for  a  permutation  set  11  is  a  set  $  =  {d>o,  <f>i, . . . ,  } 

of  permutations  such  that  for  each  rr  E  n  there  exist  <p,,  d>j  E  ^  such  that  jt  =  dj'o,. 

Equivalently,  we  can  use  our  product-of-sets  notation  to  say  that  $  is  a  difference  cover 
for  n  if  D  n. 

The  following  two  theorems  show  how  difference  covers  and  uniform  architectures  are 
related.  Theorem  4  describes  how  to  design  a  uniform  architecture  for  a  permutation  set 
n  when  a  difference  cover  for  IT  is  given.  Theorem  5  presents  a  construction  of  a  difference 
cover  for  a  permutation  set  fl  from  a  uniform  architecture  for  11 . 

Theorem  4  Let  Yl  be  a  permutation  set.  and  let  ^  be  a  difference  cover  for  IT  such  that 
|$j  =z  k.  Then  there  exists  a  uniform  architecture  for  IT  u'ith  k  pins  per  chip. 

Proof.  Let  $  =  {do.  d>i,  and  assume  that  11  is  a  set  of  permutations  on  n 

objects.  We  construct  a  permutation  architecture  for  11  with  n  busses  and  k  pins  per 
chip.  We  name  the  chips  and  busses  of  the  architecture  by  natural  numbers,  and  the  pins 
by  pairs  of  natural  numbers.  The  architecture  A  —  (C,  P.  CHIP.  BUS,  LABEL)  is  defined 
as  C  =  [n],  B  =  [n],  P  =  [n]  X  [Ar],  CHIP(c,  i)  =  c,  LABEL(c,  i)  =  i,  and  BUS(c.  ?)  = 
dLABEL(c,i)(CHIP(c. f))  =  d.(c).  That  is,  chip  c  is  connected  via  its  pin  number  z  to  bus 
d,(c). 

To  see  formally  that  this  architecture  uniformly  reaJizes  11,  let  tt  £  11  be  a  permutation, 
and  let  d,,  d)j  €  $  be  elements  of  the  difference  cover  for  11  such  that  tt  =  4>J^<t>i.  Define  the 
write  function  for  tt  as  WRlTE,(c)  =  (c,i)  and  define  the  read  function  for  tt  as  READ,(c)  = 
{c,j).  (Note  that  i  and  j  are  always  in  the  range  0  through  k  —  1.)  We  now  verify  that 
the  five  Conditions  of  Definition  3  are  satisfied.  Condition  1  holds  since  for  any  chip 
c  G  C  we  have  CHIP(WRITE,(c))  =  CHIP(c,i)  =  c,  and  CHlP(READ,(c))  =  CHIP(c,j)  =  c. 
Condition  2  is  satisfied  since  for  any  chip  c  €  C7  we  have 

BUS(WRITE,(c))  =  BUS(c,l) 

=  d.(c) 

=  d;d7‘d.(c) 

= 

=  Bus(7r(c),» 

=  BUS(READ,(zr(c))). 
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Condition  3  holds  because  if  Brs(\VRiTE^(ci))  =  BUS(WRlTE^(c2))  for  any  two  chr>,- 
C].C2  €  C  then  we  have  <p,[ci)  =  0,(02),  which  implies  that  Cj  =  C2,  since  o.  is  invertible. 
Conditions  4  and  5  both  hold  since  LABEL(\VR!TE„(c))  =  i  and  LABEL{READ.(c))  =  j  for 
all  chips  c  G  C.  We  therefore  conclude  that  the  architecture  A  uniformly  realizes  11.  The 
architecture  is  uniform,  but  Theorem  2  obviates  the  need  to  show  this  fact.  H 

Given  a  difference  cover  of  small  cardinality,  Theorem  4  says  we  can  construct  a  uniform 
architecture  with  few  pins  per  chip.  In  fact,  the  reverse  is  true  as  well,  as  the  following 
theorem  show's. 


Theorem  5  Let  II  te  a  permutation  set,  and  let  A  be  a  uniform  architecture  for  IT  u-itl; 
k  pins  per  chip.  Then  11  has  a  dijfcicnce  covert  such  that  |$|  <  k. 

Proof.  Given  a  uniform  architecture  ^  =  (C,  B,  P,  CHIP.  BUS,  LABEL)  for  the  permutat  ion 
set  n.  where  k  is  the  number  of  pins  on  each  chip,  we  construct  a  difference  cover  for  0 
as  follows.  .Assume  without  loss  of  generality  that  C  =  B  =  [n]  and  range(LABEL)  =  [k]. 
For  each  pin  number  i,  where  i  =  0....,/:  —  1,  we  define  d>,  by  (^.(c)  =  6  if  and  only  if 
chip  c  is  connected  via  its  pin  number  i  to  bus  b.  We  now  define  the  difference  cover  to 

be  the  set  4>  =  {d'o-d’i . d**-]}-  (The  set  $  may  have  less  than  k  elements,  since  some 

permutations  may  be  repeated  among  the  difs.) 

To  see  that  $  is  a  difference  cover  for  IT,  consider  any  permutation  rr  6  H.  Since  A 
uniformly  realizes  tt,  there  exists  a  pair  of  pin  labels  (t,;)  such  that  r  is  realized  by  each 
chip  writing  to  its  pin  numbered  i  and  reading  from  its  pin  numbered  j.  The  labels  i  and  j 
satisfy  i  =  LABEL! WRITE, (c))  and  j  =  LABEL(READ,(c))  for  an  all  chips  c  €  P".  as  follows 
from  Conditions  4  and  5  of  Definition  3.  Conditions  1  and  3  of  Definition  3  imply  that 
0,  and  Oj  are  both  permutations,  and  therefore  there  are  (i>h,d>i  G  ^  such  that  Oe  =  o, 
and  <f>i  =  4>j.  Finally,  Condition  2  of  Definition  3  implies  that  t  =  <p~^<pi  =  0F’0h.  which 
proves  that  $  is  indeed  a  difference  cover  for  IT.  I 

Theorems  4  and  5  show  that  uniform  architectures  and  difference  covers  are  very  close! v 
related.  Thus,  when  designing  a  uniform  permutation  architecture  for  a  set  of  permuta¬ 
tions,  it  suflBces  to  focus  on  the  problem  of  constructing  a  good  difference  cover  for  that 
set. 

The  structure  of  a  permutation  set  can  be  helpful  in  obtaining  a  difference  cover  for  it. 
In  Sections  4  and  5,  we  investigate  the  construction  of  difference  covers  for  cyclic  groups 
of  permutations  and  for  groups  in  general.  Here,  we  examine  permutation  sets  formed  by 
Cartesian  products. 

Definition  6  Let  Hi  be  a  set  of  permutations  from  to  A'l,  and  let  112  be  a  set  of 
permutations  from  X2  to  X2.  The  Cartesian  product  11  =  Hi  x  112  is  set  of  permutations 
from  X\  X  X2  to  Xi  x  X2  defined  as  11  =  {(7ri,7r2)  :  ttj  G  ni,7r2  G  112} •  Operations  on  the 
elements  of  IT  are  performed  componentwise. 
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The  Cartesian  product  111  x  Oj  is  isomorphic  to  the  Cartesian  product  IIt  x  Hi.  The 
Cartesian  product  IT  =  111  x  112  is  an  abelian  permutation  set  if  and  only  if  both  lIi  and 
IIt  are  abelian  permutation  sets. 

The  next  two  lemmas  provide  bounds  on  the  size  of  difference  covers  for  Cartesian  prod¬ 
ucts  of  permutation  sets.  (Similar  lemmtis  hold  for  composition  products  of  permutation 
sets.) 

Lemma  6  Let  Oi  be  a  permutation  set  on  objects,  and  let  112  ^  permutation  set  on 

objects.  Then  the  Cartesian  product  11  =  Hi  x  112,  tvhich  is  a  permutation  set  on  -n^ 
objects,  has  a  difference  cover  of  size  lllil  +  |n2l. 

Proof  Let  be  the  union  of  {(’'■fS/nj)  :  ’I’l  €  n]|  and  {(/„,, 7r2)  :  tr2  €  02}.  Each  per¬ 
mutation  TT  =  ('i.TTo)  e  n.  can  be  represented  as  (7ri,7r2)  =  "'here 

both  \  )  and  (/„,,r2)  are  in  $.  Thus  is  a  difference  cover  for  FI.  and  the  size  of  $ 

is  exactly  lOi!  -t-  |n2|.  H 

Lemma  7  Let  Ui  be  a  permutation  set  on  rij  objects  with  a  difference  cover  <^1,  and  let  112 
be  a  permutation  set  on  objects  with  a  difference  cover  $2-  Then  the  Cartesian  product 
=  4>i  X  <J>2  t5  a  difference  cover  for  11  =  lli  x  02. 

Proof.  For  each  r  =  (^1.^2)  €  0.  there  exist  such  that  tti  =  and 

there  exist  €  ^>2  such  that  7r2  =  <*>,,.  We  then  have  (^1,7:2)  =  (oJiVi,, )  = 

(O;,.  o,j)~‘(o,i,o,j).  where  both  and  are  in  ^  x  $2,  and  hence  4> 

is  a  difference  cover  for  11.  I 

To  demonstrate  both  the  use  of  difference  covers  and  Lemma  7,  we  present  in  Figure  2  a 
uniform  permutation  architecture  due  to  C.  Fiduccia  [10]  for  realizing  shifts  in 
a  two-dimensional  array.  The  architecture  uniformly  realizes  the  permutation  set 
n  =  {L  N,  E,  S,  W,  NE.  SE,  NW',SW}  of  eight  compass  directions  plus  the  identity  I.  We 
introduce  two  permutation  sets  IIi  =  {I,N,S},  112  =  {LEiW},  and  corresponding  differ¬ 
ence  covers  =  {LS}  and  4>2  =  {LE}.  The  Cartesian  product  111  x  112  is  H,  and  the  set 
of  permutations  x  $2  =  {S,SE,  E,I}  is  a  difference  cover  for  IT. 

We  conclude  this  section  by  defining  the  notion  of  a  substring  cover  for  a  permutation 
set  n,  which  is  equivalent  to  the  notion  of  a  difference  cover.  (A  similar  notion  for  difference 
sets  is  well  known  in  the  literature  [6,  26].) 

Definition  7  An  ordered  list  E  =  {<^0^  <^i<  ■  ■  ■  of  permutations  is  a  substring  cover 

for  a  permutation  set  11  if 

1 .  Cqct \  ' ' '  ^k—i  ”1,  and 

2.  for  all  TT  6  n,  there  exist  0  <  iff  <  k  —  I  such  that  jr  =  (7,<7,>i  where  the 

arifhaietic  in  the  indices  is  performed  modulo  k. 
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Figure  2:  A  uniform  architecture  due  to  C.  Fiduccia  [10]  baaed  on  the  difference  cover  {S,  SE.  E.I) 
for  the  permutation  set  11  =  {I,N,E,S,W,NE,SE,NW,S\V}. 

The  substring  cover  E  is  a  list  of  permutations  such  that  all  the  permutations  in  fl  can 
be  represented  as  a  composition  of  a  substring  of  permutations  of  E.  The  following  two 
theorems  show  that  the  notions  of  a  substring  cover  and  difference  cover  are  equivalent. 

Theorem  8  Let  fl  6e  a  permutation  set  on  n  elements,  and  let  E  be  a  k-element  substring 
cover  for  11.  Then  11  has  a  difference  covert  u'ith  at  most  k  elements. 

Proof.  Given  a  A:-element  substring  cover  E  =  (<ro,  <ri, . . . ,  ak-\)  for  11,  a  difference  cover 
$  with  at  most  k  elements  can  be  constructed.  For  each  0  <  i  <  —  1  we  define  = 

aoOi  •  •  ■  <T,.  If  a  permutation  tt  can  be  represented  as  tt  =  (7,a,+i  ■  a,,  then  n  — 

By  construction,  the  difference  cover  $  hats  at  most  k  elements.  I 

Theorem  9  Let  11  6c  a  permutation  set  on  n  elements,  and  let  9  be  a  k-element  difference 
cover  for  IT.  Then  11  has  a  substring  cover  E  with  k  elements. 

Proof.  Given  a  k-elemeni  difference  cover  $  =  for  11,  we  build  a  sub¬ 
string  cover  E  for  11  by  defining  <r,  =  for  all  0  <  t  <  fc  —  1.  The  product  ctq  •  •  •  Ck-i 

yields  the  identity  permutation.  For  each  tt  €  11,  if  ir  =  then  it  =  <7,+i<7,+2  ■  ■  -  cty 

Therefore  E  is  a  substring  cover  for  11  with  k  elements.  ■ 

Referring  back  to  the  example  of  the  eight  compass  directions,  we  present  a  substring 
cover  for  the  permutation  set  11  =  {I,  N,  E,  S,  W,  NE,  SE,  NVV,  SW}.  Tlic  substring  cover 
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n  =  (S,E,  N,W)  is  constructed  from  the  difference  cover  =  {S.SE,  E.I}  that  was  used 
in  the  architecture  of  Figure  2.  Each  of  the  eight  compass  directions  can  be  realized  as  a 
substring  of  the  list  E  =  (S,  E.  N,  W). 


Figure  3:  A  uniform  architecture  due  to  C.  Feynman  [15]  based  on  the  difference  cover  {N,E,1} 
for  the  permutations  set  11  =  {I,N,E,S,W}. 

As  another  example,  consider  the  permutation  set  11  =  {I,N,E,S,W}  of  the  shifts  in 
a  2-dimensional  array  corresponding  to  the  four  compeiss  directions.  This  permutation  set 
has  a  difference  cover  $  =  {N,E,I}  and  a  corresponding  substring  cover  E  =  {N,SE,  W). 
Consequently,  there  is  a  uniform  architecture  for  realizing  the  four  compass  directions 
with  three  pins  per  chip,  as  has  been  observed  by  C.  Feynman  [15,  pp.  437-438].  Fig¬ 
ure  3  presents  a  uniform  architecture  based  on  the  difference  cover  $  =  {N,E,  1}  for  the 
permutation  set  FI  =  {I,  N,E,S,W}. 


4  Cyclic  shifters 

This  section  describes  uniform  architectures  for  realizing  cyclic  shifts  among  n  chips  in 
one  clock  tick.  We  first  present  a  difference  cover  of  size  0{‘</n)  for  the  set  of  all  n 
cyclic  shifts  on  n  elements,  and  we  give  an  area-efficient  layout  for  the  corresponding 
permutation  architecture  suitable  for  implementation  as  a  printed-circuit  board.  When  n 
can  be  expressed  as  n  =  -f-  ^  -f-  1,  where  9  is  a  power  of  a  prime,  we  improve  the  bound 

on  the  size  of  a  difference  cover  for  all  cyclic  shifts  on  n  elements  to  the  optimal  value  of 
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Finally,  we  prove  that  for  any  cyclic  shifter  that  operates  in  one  clock  tick  (even  a 
nonuniform  one),  the  average  number  of  pins  per  chip  is  at  least 

The  first  permutation  architecture  for  cyclic  shifters  that  we  present  is  based  on  the 
construction  in  the  following  simple  theorem. 

Theorem  10  The  set  of  n  cyclic  shifts  on  n  elements  has  a  difference  cover  of  size  at 
most  2  [v/n]  ~  1- 

Proof.  Since  the  set  of  n  cyclic  shifts  on  n  elements  forms  a  group,  and  since  this  group  is 
isomorphic  to  the  group  Z„,  we  shall  construct  a  difference  cover  D  for  Zn-  For  convenience, 
let  m  =  f\/n ] .  Define  two  sets  v4  =  {0, 1, . . .  ,m  —  1}  and  B  =  {0, m, 2m, . . . ,  (m  —  l)m}, 
and  let  the  difference  cover  D  be  defined  hy  D  =  AU  B.  Each  element  s  €  Z„  can  be 
realized  as  s  =  b  —  a  (modn),  where  a  ^  A  and  b  ^  B  hy  taking  a  =  m  —  (s  mod  m)  and 
b  =  fs/m]  -m,  as  can  be  verified.  The  size  of  the  difference  cover  D  is  2m  —  1=2  [v^l  ~  1- 
since  the  element  0  occurs  in  both  A  and  B.  I 

The  difference  cover  constructed  in  the  proof  of  Theorem  10  corresponds  to  an  archi¬ 
tecture  with  a  regular,  area-efficient  layout,  as  shown  in  Figure  4.  The  n  chips  of  the 
architecture  are  laid  out  in  an  array  consisting  of  m  =  y/n  rows,  each  containing 
chips.  (For  simplicity,  we  assume  that  n  is  a  square.)  Each  chip  has  pins  0, 1, . . .  ,m  —  1 
on  the  top  side,  and  pins  m,m  +  l,...,2m  —  1  on  the  left  side.  Each  bus  consists  of 
one  vertical  segment  and  one  or  two  horizontal  segments.  Each  wiring  channel  consists  of 
m  =  y/n  tracks,  where  each  track  is  used  to  lay  out  segments  of  busses.  When  n  is  not 
a  square,  a  cyclic  shifter  on  n  chips  can  be  laid  out  in  a  similar  fashion,  with  each  wiring 
channel  having  at  most  2  [>/nl  tracks.  The  side  of  the  layout  is  therefore  0(n),  since 
there  are  ly/n]  chips  and  \y/n'\  wiring  channels  along  the  side.  The  area  of  the  layout  is 
C)(n*),  which  is  asymptotically  optimal  since  £iny  architecture  that  can  realize  any  of  the 
cyclic-shift  permutations  in  one  clock  tick  requires  area  n(n^)  [30,  p.  56]. 

Remark.  The  bound  of  2  [\/nl  ~  1  per  chip  can  be  improved  to  {y/2  -I-  o(l))>/n. 
See  Section  8. 

Occasionally,  it  is  desirable  to  implement  a  subset  of  the  cyclic  shifts  on  n  elements.  The 
following  corollary  to  Theorem  10  shows  that  when  the  shift  aunounts  form  an  arithmetic 
sequence,  a  small  difference  cover  exists. 

Corollary  11  Let  a,  b,  and  p  be  integers  modulo  n.  For  each  r  €  \p\,  define  tt,  to  be  the 
permutation  on  [n]  that  maps  each  c  €  [n]  to  c  +  a  +  rb  (modn).  Then  the  permutation  set 
{ifr  :  r  €  [p]}  has  a  difference  cover  of  size  2  [\/p]- 

Proof.  As  in  the  proof  of  Theorem  10,  we  construct  two  sets  A  and  B  whose  union  is 
the  desired  difference  cover.  The  sets  are  A  =  {0, 6, 26, . . . ,  (m  —  1)6}  and  B  =  {a,  a  -|-  mb. 
a  -f  2m6, . . .  ,a  -f  (m  —  l)m6},  where  m  =  [v/p ]-  B 

Returning  to  the  problem  of  implementing  all  n  cyclic  shifts  on  n  elements,  the  follow¬ 
ing  theorem  demonstrates  that  for  certain  values  of  n,  the  optimal  bound  can  be 

obtained. 
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Figure  4:  A  layout  for  a  cyclic  shifter  with  n  =  16  chips.  Each  chip  and  each  bus  ha^  7  pins. 
Each  bus  is  constructed  of  one  vertical  segment  and  either  one  or  two  horizontal  segments. 

Theorem  12  The  set  of  n  cyclic  shifts  on  n  elements  has  a  difference  cover  of  size 
if  n  =  +  q  +  \  ,  where  q  is  a  power  of  a  prime. 

Proof.  As  in  the  proof  of  Theorem  10,  the  problem  is  equivalent  to  that  of  constructing 
a  difference  cover  D  for  Z„.  When  n  is  the  size  of  a  projective  plane  (n  =  +  9  +  1, 

where  g  is  a  power  of  a  prime),  this  problem  is  equivalent  to  the  problem  of  constructing  a 
difference  set.  The  difference  set  we  give  is  due  to  Singer;  a  proof  of  its  correctness  is  given 
in  Hall  [18,  p.  129].  Let  z  be  a  primitive  root  of  the  Galois  field  GF(g^),  and  let  F(y)  be 
any  irreducible  cubic  polynomial  over  the  Galois  field  GF(g).  We  construct  a  difference 
cover  D  for  Z„  from  the  set  [n]  by  choosing  those  *  €  [n]  such  that  the  power  i‘  can  be 
written  in  the  form  x'  =  ax  +  b  (mod  F(i))  for  some  a,  6  6  GF(g).  I 

The  construction  of  a  uniform  architecture  based  on  a  projective  plane  can  be  inter¬ 
preted  as  follows.  The  n  points  of  the  projective  plane  correspond  to  the  n  chips  and  the 
n  lines  of  the  projective  plane  correspond  to  the  n  busses.  Each  line  contains  q  +  l  points, 
which  means  that  each  bus  is  connected  to  9  -I-  1  chips.  Each  point  is  incident  on  9  +  1 
lines,  which  means  that  each  chip  is  connected  to  q  +  1  different  busses  through  its  9  -f  1 
pins.  For  example,  Figure  1  demonstrates  a  uniform  architecture  based  on  the  ^^ojective 
plane  of  size  13. 
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Theorems  similar  to  Theorem  10  (but  without  application  to  architecture'  appear  in 
the  combinatorics  literature;  see,  for  example,  [22].  Bus  connection  networks  based  on 
projective  planes  have  also  been  studied  by  Bermond,  Bond,  and  Scale  [4]  and  by  Mick- 
unas  [25],  who  observed  that  projective  planes  can  be  used  to  construct  hypergraphs  of 
diameter  one. 

Uniform  architectures  for  cyclic  shifters  based  on  projective  planes  achieve  the  minimal 
number  of  pins  per  chip  among  all  uniform  cyclic  shifters.  We  now  prove  a  lower  bound  of 
[■y/n  ]  on  the  average  number  of  pins  per  chip  for  any  permutation  architecture  that  realizes 
all  the  cyclic  shifts.  This  lower  bound  applies  to  all  permutation  architectures,  including 
nonuniform  ones,  and  shows  that  uniform  cyclic  shifters  based  on  projective  planes  are 
optimal  among  all  cyclic  shifters  that  operate  in  a  single  clock  tick. 

Theorem  13  Let  A  =  (C,  B,  f*, CHIP,  BUS,  label)  bt  a  permutation  architecture  for  the 
n  cyclic  shifts  on  n  chips.  Then  the  average  number  of  pins  per  chip  is  at  at  least 

Proof.  The  average  number  of  pins  per  chip  is  jPj  /n.  We  shall  prove  that  |F|  >  n  f^/n  ] 
which  implies  the  theorem.  We  adopt  the  following  conventions  for  notational  convenience; 

1.  The  set  of  busses  is  B  =  {bo,bi,  •  ■  ■  ,bm-i}.  We  denote  by  ki  the  number  of  pins 
connected  to  bus  6j,  that  is,  ki  =  \{p  ^  P  :  BUS(p)  =  6,}|. 

2.  The  busses  that  have  at  least  fv^l  P'l's  each  are  indexed  first,  that  is,  if  there  are 
r  busses  with  at  least  fv/n]  pins  each,  then  k,  >  fy/n]  ior  i  =  0,  ...,r  —  ]  and 
ki  <  [v^l  for  f  =  r, .,m  —  1. 

The  thrust  of  the  proof  is  to  count  the  number  of  distinct  data  transfers  when  the 
architecture  realizes  each  of  the  n  —  1  nontrivial  shifts  in  turn.  (The  identity  permutation 
is  a  trivial  shift.)  Each  chip  can  be  mapped  to  each  other  chip  by  one  of  the  cyclic  shifts, 
i.e.,  the  cyclic  shifts  form  a  transitive  group  of  permutations.  Considering  only'  the  n  —  1 
nontrivial  shifts,  there  are  exactly  n(n  — 1)  distinct  data  transfers  that  must  be  implemented 
through  interconnections  in  the  architecture. 

We  compute  an  upper  bound  on  the  number  of  distinct  data  transfers  that  the  busses 
can  implement.  Each  of  the  first  r  busses  6p, . . . ,  6,-1  can  be  employed  to  realize  at  most 
one  distinct  data  transfer  in  each  of  the  n  —  1  nontrivial  shifts.  Thus,  at  most  r(n  —  1) 
distinct  data  transfers  can  be  carried  out  by  the  first  r  busses.  Any  other  bus  bi,  where 
r  <  1  <  m  —  1,  can  realize  at  most  fc,(fc,  —  1)  distinct  nontrivial  data  transfers,  since  it  has 
only  ki  pins  connected  to  it.  Thus,  the  total  number  of  distinct  data  transfers  that  the 
busses  can  realize  is 

-  1)+  E  ’ 

i=r 

which  must  be  larger  than  n{n  —  1)  if  all  nontrivial  shifts  are  to  be  realized.  Hence,  we 
have  ^ 

E  0  -  U  • 

i=r 
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W'e  can  use  this  inequality  to  bound  the  number  of  pins  on  all  busses  with  fewer  than 
[•^/n]  pins.  We  have  A:,  —  1  <  [v/n]  —  2  for  i  =  r . m  —  1,  and  thus 

m  — 1  1  m  — 1 

>  (n  -  r)(n  -  1) 

-  rv^i-2 

>  (n  -  r)  [v/n]  . 

We  now  bound  the  total  number  of  pins  in  the  architecture  from  below.  We  have 

m-1 

|p|  =  Et, 

1=0 

r  — 1  m-1 

=  E‘.+ 

t=0  i=r 

>  r  [\/n]  +  (n  -  r)  [v/n] 

=  nfVn], 

which  proves  the  theorem.  I 


5  Difference  covers  for  groups 

In  this  section  we  show  that  small  difference  covers  for  abelian  and  nonabelian  permutation 
groups  exist.  Specifically,  for  any  permutation  group  11  with  p  elements,  we  show  how  to 
construct  a  difference  cover  with  O(v^plgp)  elements.  In  the  case  where  11  is  abelian,  we 
apply  the  decomposition  theorem  for  finite  abelian  groups  and  the  results  for  cyclic  shifters 
in  Section  4  to  sharpen  this  bound  to  0{y/p),  which  is  optimal  to  within  a  constant  factor. 

As  the  first  result  of  this  section,  we  give  a  method  for  constructing  a  small  difference 
cover  for  an  arbitrary  permutation  group. 

Theorem  14  Let  11  be  an  arbitrary  group  with  p  elements.  Then  11  has  a  difference  cover 
$  of  size  at  most  y'Spln  p  +  1 , 

Proof.  We  construct  a  difference  cover  incrementally  starting  with  a  partial  difference 
cover  $1  =  {/}.  At  each  step  of  the  construction,  we  select  an  element  ^,+i  €  11  such 
that|$r‘(^.U{(?i.+,})|  maximizes  U  {ir})|  over  all  jt  €  H.  We  then  define  the  new 

partial  difference  cover  as 

The  analysis  of  this  construction  is  in  three  parts.  We  first  determine  a  lower  bound  on 
the  number  of  elements  of  11  that  are  not  covered  by  the  partial  difference  cover  but  are 
covered  by  $i+i.  We  then  develop  a  recurrence  to  upper  bound  the  number  of  elements 
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of  the  group  11  that  are  not  covered  at  the  ith  step.  Finally,  we  solve  the  recurrence  to 
determine  that  the  number  k  of  iterations  needed  to  cover  ail  ekments  in  11  is  at  most 
y/'lp  In  p  +  1. 

VVe  first  determine  how  many  new  elements  of  11  are  covered  when  is  augmented  with 
to  produce  $,+i,  for  2  >  1.  Let  the  set  A,  be  the  set  of  elements  that  are  not  covered 
by  the  partial  difference  cover  which  can  be  defined  £is  A,  =  11  —  Consider 

triples  of  the  form  {4>,  <5,  ir)  such  that  4>  €  S  6  A,,  ir  G  U,  and  (^6  =  ir.  Observe  that  for 
any  fixed  tt  6  11  and  S  €  A,,  there  is  at  most  one  triple  of  the  form  {<p,6,7r)  in  the  set  of 
triples,  namely  {tt ,  6,  tt)  when  6  For  a  fixed  tt,  the  number  of  triples  {o.S,::) 
in  the  set  of  triples  is  a  lower  bound  on  the  number  of  elements  covered  by  4>,  U  {r}  but 
not  by  since  w’e  have  6  =  and  ^  €  A,  =  11  —  For  each  <f)  €  and  6  €  A,, 

there  is  exactly  one  triple  in  the  set  of  triples,  and  thus  there  are  exactly  |$,|  •  |A,  |  triples. 
Since  there  are  at  most  jn|  distinct  permutations  appe<iring  as  the  third  coordinate  of  a 
triple,  the  permutation  <p,+i  that  appears  most  often  must  appear  at  least  |^>,|  ■  |A,1  /  jlll 
times,  and  hence  at  least  this  many  elements  are  covered  by  that  are  not  covered  bv 

We  can  now  bound  the  number  of  elements  not  co%'ered  by  in  terms  of  the  number 

of  elements  not  covered  by  by 


|A,+i|  <  |A,| — 


I^.I-IA.I 


When  we  obtain  jAfcl  <  1  for  some  k,  the  partial  difference  cover  4>jt  is  a  difference  cover 
for  n  because  A*  is  empty.  Thus,  is  a  difference  cover  when 


or  equivalently,  when 


In  p  +  ^  In  <0, 


Using  the  inequality  ln(l  +  i)  <  x,  we  have 

Inp+Elnfl-^)  <  Inp-E^ 
j=i  \  P/  }=i  P 

1 

=  lnp--Ej 
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<  In  p  — 

<  0. 


(^-1)2 

2p 


Thus.  is  a  difference  cover  when  k  >  y/2p\n  p  +  1.  B 

This  proof  of  Theorem  14  provides  a  construction  which  can  be  implemented  as  an 
deterministic,  polynomial-time  algorithm  with  O(p^lgp)  algebraic  steps.  We  could  also 
have  proved  the  theorem  by  relying  on  the  result  of  Babai  and  Erdos  [2]  that  any  group 
has  a  small  set  of  generators,  but  this  method  would  have  produced  only  an  existential 
(nonconstructive)  result. 

We  have  shown  that  there  are  difference  covers  of  size  0{\/p\gp)  for  general  permuta¬ 
tion  groups  with  p  elements.  We  now  show  that  if  the  group  is  abelian,  difference  covers 
of  size  0{y/p)  exist. 


Theorem  15  For  any  abelian  group  IT  with  p  elements,  there  exists  a  difference  cover  $ 
of  size  at  most  Zy/p  . 


Proof.  Assume  without  loss  of  generality  that  p  >  1.  By  the  decomposition  theorem  for 
finite  abelian  groups  [23,  p.  133],  any  abelian  group  11  is  isomorphic  to  a  cross  product  of 
cyclic  groups 

n  ss  Zp,  X  Zpj  X  •  •  •  X  Zp4, 

where  pip2  •  •  ■  Pk  =  P-^  and  each  pj  >  2.  Let  i  be  the  unique  index  such  that  pipj  •  •  •  p,_i  < 
y/p  and  Pi+iPi+2  •  •  ■  Pk  <  y/p,  and  let  m  =  \y/p /piP2  -  •  •  Using  the  argument  of 

Theorem  10,  we  first  construct  a  difference  cover  for  Zp,  from  the  union  of  two  sets  .4.  and 
B,,  where  |.4,|  <  m  and  |B,|  <  [p,/mj,  such  that  each  element  of  Zp,  can  be  expressed  in 
the  form  b  —  a  (modp^)  or  a  —  6  (modp,),  where  a  €  A;  and  h  G  Bi. 

We  now  construct  a  difference  cover  for  11  %  Zp,  x  Zp,  x  •  •  •  x  Zp^  from  the  union  of 
two  sets  A  and  B,  where 


AaZp,  xZp,  X...  xZp._,  xAy, 


and 

B»B,x  Zp.^,  X  Zp.„  X  ...  X  Zp,. 

That  A  U  B  is  a  difference  cover  for  11  follows  from  essentially  the  same  argument  as  is 
used  in  Lemma  7. 

The  size  of  the  difference  cover  A  U  5  is  1A|  -I-  1B|.  The  size  of  A  is 


l>l| 


= 

P1P2  ■ 

■Pi-i  |A.| 

< 

PIP2  • 

■Pi-im 

< 

PIP2  • 

■  Pi-i  \\/p/PiP2---Pi-i] 

< 

y/p  +P\P2  -Pi-l 

< 

2y/p. 
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Siniilarlv,  the  size  of  B  is 


\B\  =  \B,\p,+:p,+2---Pk 

<  [Pt/^\pi+\Pt+2---Pk 

<  {Pif  [VP/PlP2'--P-ll)P-+lP‘+2---Pfc 

<  {p\P2---Ptfy/p)Pi+\Pi+2‘-‘Pk 

=  Vp- 

Consequently,  the  size  of  the  difference  cover  for  11  is  at  most  3y^/p.  I 


6  Multiple  clock  ticks 

In  this  section  we  discuss  uniform  permutation  architectures  that  realize  permutations  in 
several  clock  ticks.  By  using  more  than  one  clock  tick,  further  savings  in  the  number  of 
pins  per  chip  can  be  obtained.  We  generalize  the  notion  of  a  difference  cover  to  handle 
multiple  clock  ticks,  and  describe  a  cyclic  shifter  on  n  chips  with  only  pjns  per 

chip  that  operates  in  t  ticks. 

We  first  generalize  the  notion  of  a  difference  cover  to  handle  realization  of  permutations 
in  t  >  1  clock  ticks. 

Definition  8  A  i-differtnct  cover  for  a  permutation  set  IT  is  a  set  $  of  permutations  such 
that  2  n. 

Using  a  t'difference  cover  $  for  the  permutation  set  11,  any  permutation  tt  €  11  can  be 
expressed  as  the  composition  of  t  differences  of  permutations  from  The  next  lemma 
relates  t-difference  covers  to  permutation  architectures  that  realize  permutations  in  i  clock 
ticks. 

Lemma  16  Let  ^  be  a  t-difference  cover  with  k  elements  for  a  permutation  set  11.  Then 
there  is  a  permutation  architecture  with  k  pins  per  chip  that  uniformly  realizes  11  in  t  clock 
ticks. 

Proof.  We  define  the  permutation  set  E  =  Let  A  =  CHIP,  BUS,  label) 

be  the  permutation  architecture,  based  on  the  difference  cover  $,  that  uniformly  realizes 
E.  Hence,  the  permutation  architecture  A  can  uniformly  realize  any  <t  €  E  in  one  clock 
tick.  Each  permutation  ir  €  II  can  be  expressed  as  ir  =  •  •  •<^0i  where  <t,  6  E  for 

0  <  1  <  <  —  1,  since  we  have  E*  =  ($“*$)*  2  H.  In  order  to  realize  jt  in  t  clock  ticks,  the 
permutation  architecture  A  uniformly  realizes  <7,  in  clock  tick  tforO<»<t  —  1.  I 

Lemma  16  claims  that  the  problem  of  uniformly  realizing  a  permutation  set  11  in  t 
clock  ticks  can  be  reduced  to  finding  a  permutation  set  E  such  that  E‘  2  U,  and  then 
finding  a  difference  cover  for  E.  The  great  advantage  of  using  more  than  one  clock  tick  is 
in  the  further  savings  in  the  number  of  pins  per  chip.  The  following  theorem,  for  example, 
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describes  a  construction  of  a  t-difference  cover  of  size  for  the  set  of  cyclic  shifts 

on  n  objects.  This  result  can  be  used  to  build  a  uniform  architecture  on  n  chips  with  only 
pins  per  chip  that  can  realize  any  cyclic  shift  on  the  n  chips  in  t  clock  ticks. 

Theorem  17  For  any  n  >  1  and  <  >  1,  the  permutation  set  of  all  the  n  cyclic  shifts  on  n 
objects  has  a  t-difference  cover  of  size 

Proof.  For  the  purpose  of  the  proof,  we  denote  the  permutation  set  of  all  the  n  cyclic 
shifts  on  n  objects  by  !!„.  (We  remind  that  !!„  «  Z„.)  We  first  treat  the  case  for  those 
n  such  that  there  exists  an  integer  m  satisfying  ^  m  <  4n'/‘  and  gcd(m,n)  =  1.  We 
then  use  this  case  to  extend  the  proof  to  all  values  of  n. 

Since  gcd(m,Ti)  =  1,  there  exists  an  m“*  6  Z„  such  that  m  •  m“'  =  1  (modn).  For 
each  r  6  [m],  define  the  permutation  :  [n]  — >■  [n]  as  <Tr(c)  =  m~'(c  +  r)  (modnl.  and 
define  the  permutation  tr'  :  [n]  — ►  [n]  as  <t'(c)  =  m'“^(c  +  r)  (modn).  Next  define  the 
permutation  set  E  =  {c,}  U  The  set  {cr,.}  is  an  arithmetic  sequence  of  cyclic  shifts 

on  n  elements  (as  in  Corollary  11)  followed  by  the  fixed  permutation  corresponding  to 
multiplication  by  m~\  and  thus  {<7,}  has  a  difference  cover  of  size  Ois/m).  Similarly,  the 
set  {cr'}  has  a  difference  cover  of  size  Combining  the  two  difference  covers  for 

{<7,}  and  {<7'},  we  get  a  difference  cover  $  of  size  0{y/m)  =  0{ri^l^^)  for  E. 

We  now  show  the  inclusion  E‘  3  fin.  Let  tt  6  n„  be  a  permutation  of  a  cyclic  shift  by 
s.  We  express  the  shift  amount  s  €  [n]  as  s  =  sq  +  +  •  •  •  +  where  s,  6  [m] 

for  0  <  i  <  t  —  1.  The  permutation  tr  can  be  described  as 

ir(c)  =  c  +  s  (mod  n) 

=  c  +  So  +  Sjm  + - 1-  S(_im‘~^  (mod  n) 

=  m‘~^  (s(_i  +  m”' (s<_2  + - h  m~' (so  +  c)^^  (modn) 

which  proves  that  tt  €  EL  Hence,  we  get  the  inclusion  E*  3  n„,  which  together  with  the 
fact  that  there  is  a  difference  cover  $  of  size  0(n*/^‘)  for  E,  proves  the  theorem  for  the 
case  when  there  exists  an  integer  m  satisfying  <  m  <  and  gcd(m,n)  =  1. 

Such  an  m  need  not  exist  for  every  n  and  every  t,  however.  W’e  can  overcome  this 
diflRculty  by  factoring  n  =  nin2  such  that  nj  consists  of  no  even-indexed  primes  (3,  7,  13, 
. . .)  and  n2  consists  of  no  odd-indexed  primes  (2,  5,  11, . . .).  Since  we  have  gcd(ni,  n2)  =  1, 
we  can  use  the  Chinese  remainders  theorem  to  express  Z„  as  a  Cartesiam  product  Z„  » 
Z„,  X  Z„j.  We  let  mj  be  the  first  even-indexed  prime  at  least  as  large  as  n}^‘,  and  let 
m2  be  the  first  odd-indexed  prime  at  least  as  large  as  Bertrand’s  postulate  [19, 

p.  343]  guarantees  that  for  every  x,  there  is  a  prime  between  x  and  2x,  which  means 
rrij  6  [ny‘,4ny‘]  for  j  =  1,2.  (Tighter  bounds  are  possible.) 

We  can  now  use  the  previous  construction  to  construct  a  t-difference  cover  of  size 
0(n}^*‘)  for  Z„,,  which  is  isomorphic  to  ,  and  a  f-difference  cover  $2  of  size  0{ny^^) 
for  Z„,,  which  is  isomorphic  to  Using  the  same  technique  as  in  the  proof  of  Lemma  7. 
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we  can  construct  a  ^-difference  cover  of  size  0{n\^^^)  ■  for  Z^.  y  ~ 

z„  n„.  ■ 

One  can  rather  straightforwardly  use  Corollary  11  to  obtain  a  t-difference  cover  of  size 
Based  on  the  representation  of  the  shift  amount  5  =  sq  +  51771  -!-•••  +  Sf.im'"’ , 
one  can  come  with  i  separate  difference  covers,  each  of  size  for  the  t  separate 

sequences  of  arithmetic  shifts  by  {sm'  :  s  6  [m]}  for  0  <  i  <  t  —  1.  Theorem  17  avoids 
the  extra  factor  of  t  by  constructing  only  one  such  difference  cover  and  using  its  elements 
for  each  one  of  the  t  differences. 


7  Extensions 

This  section  contains  some  additional  results  on  permutation  architectures  and  difference 
covers.  We  describe  efficient,  uniform  architectures  that  can  realize  the  permutations 
implemented  by  various  popular  interconnection  networks,  including  multidimensional 
meshes,  hypercubes,  and  shuffie-exchange  networks.  We  examine  nonuniform  permuta¬ 
tion  architectures,  and  adapt  some  combinatorial  results  in  the  literature  to  apply  to 
permutation  architectures.  A  result  of  DeBruijn  leads  to  a  nonuniform  architecture  with 
0{  v/n  Ig  n  )  pins  per  chip  that  can  realize  all  n!  permutations  on  n  chips. 


7.1  Specific  networks 

By  using  busses,  many  popular  interconnection  networks  can  be  realized  with  fewer  pins 
than  conventionally  proposed.  Here,  we  mention  a  few. 

The  permutation  architectures  for  realizing  compass  shifts  on  two-dimensional  arrays 
can  be  extended  in  a  natural  fashion  to  d-dimensional  arrays.  For  the  d-dimensional 
analogue  of  the  shifts  {I,  N,E,  S,W},  there  is  a  uniform  architecture  that  uses  only  d  +  1 
pins  per  chip  to  implement  the  2d  +  1  permutations.  For  the  d-dimensional  analogue  of 
the  shifts  {I,  N,E,  S,W,NE,SE,NW,SW},  there  is  a  uniform  architecture  that  uses  only 
2^  pins  per  chip  to  implement  the  S'*  permutations.  (These  two  results  were  independently 
discovered  by  C.  Fiduccia  [11,  12].) 

A  Boolean  hypercube  of  dimension  d  is  a  degenerate  case  of  a  d-dimensional  array. 
Only  d  -f  1  pins  per  chip  are  required  by  a  permutation  Mchitecture  that  uses  busses, 
whereas  2d  pins  per  chip  are  needed  if  point-to-point  wires  are  used.  (To  realize  a  swap 
of  information  across  a  dimension  in  one  clock  tick,  each  chip  requires  two  pins  for  that 
dimension;  one  to  read  and  one  to  write.) 

A  permutation  architecture  that  implements  the  permutations  Shuffle,  Inverse  Shuffle, 
and  Exchange  can  be  constructed  with  three  pins  per  chip  instead  of  the  usual  four,  and 
it  can  implement  the  Shuffle- Exchange  and  Inverse  Shuffle-Exchange  permutations  in  one 
tick  as  well. 
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7.2  Average  number  of  pins  per  chip 


Theorem  13  presents  a  lower  bound  on  the  average  number  of  pins  per  chip  in  any  cyclic 
shifter  that  operates  in  one  clock  tick.  The  following  theorem  is  a  natural  e.vtension  of 
Theorem  13  for  a  general  set  of  permutations. 

Theorem  18  Let  11  6e  a  permutation  set  on  n  objects  with  p  permutations  and  with  total 
ofT  nontrivial  data  transfers,  and  let  A  =  (C,  B,  P,  CHIP,  BUS,  LABEL)  be  any  permutation 
architecture  for  realizing  11.  Then  the  average  number  of  pins  per  chip  is  at  least  T/n^Jp  . 

Proof.  As  in  the  proof  of  Theorem  13,  we  prove  that  |P|  >  T j y/p  which  implies  the 
theorem.  We  make  similar  notational  conventions: 


1.  The  set  of  busses  is  B  =  {6oi  •  •  • ,  We  denote  by  the  number  of  pins 

connected  to  bus  b,. 

2.  The  r  busses  that  have  at  least  y/p  pins  each  are  indexed  first,  that  is  h,  >  v'7’  for 

i  =  0, . .  . ,  r  —  1  and  Ic,  <  y'p  for  t  =  r _ _  m  —  1 . 

We  count  the  number  of  distinct  data  transfers  that  can  be  accomplished  by  each  bus. 
Each  of  the  first  r  busses  can  be  employed  to  realize  at  most  p  out  of  the  T  nontrivial  data 
transfers,  since  it  can  be  used  at  most  once  for  each  of  the  p  permutation.  Any  other  bus 
6,,  where  r  <  j  <  m  —  1,  can  realize  at  most  1)  out  of  the  T  nontrivial  data  transfers, 

since  it  has  only  k,  pins  connected  to  it.  We  need  to  have  ~  1)  ^  7'  -  rp.  which 

implies 


m  - 1 


> 


T-rp 

Vp 


The  number  of  pins  in  the  architecture  can  now  be  bounded  as  follows: 

m  — 1 

\P\  =  Hk. 


1=0 

r-l 


m- 1 


1=0 


>  ^ 


y/P 


-  ’’y/P 


T 

y/P 


Theorem  18  demonstrates  that  uniform  architectures  can  achieve  the  optimal  number 
(to  within  a  constant  factor)  of  pins  per  chip  for  certain  classes  of  permutation  sets. 
When  there  are  relatively  few  permutations  that  are  responsible  for  many  nontrivial  data 
transfers,  the  average  number  of  pins  per  chip  is  high.  The  set  of  cyclic  shifts  is  an  example 
of  this  kind  of  permutation  set. 
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7.3  Nonuniform  architectures 


When  the  uniformity  condition  on  permutation  architectures  is  dropped,  one  can  do  much 
belter  in  terms  of  the  number  of  pins  per  chip.  The  complexity  of  control  may  increase 
substantially,  however,  due  to  the  irregular  communication  patterns  and  the  number  of 
possible  permutations  realizable  for  some  of  the  architectures.  Nevertheless,  from  a  math¬ 
ematical  point  of  view,  nonuniform  architectures  are  quite  interesting. 

In  fact,  nonuniform  architectures  have  been  studied  quitv’  extensively  in  the  mathe¬ 
matics  literature  in  the  guise  of  partitioning  problems.  For  the  problem  of  realizing  all  n! 
permutations  on  n  chips,  a  result  due  to  de  Bruijn,  Erdos,  and  Spencer  [31,  p.  106-10S] 
implies  that  0(v/n  Ig  n  )  pins  per  chip  suffice.  The  nonuniform  architecture  that  achieves 
this  bound  is  constructed  probabilistically,  however.  It  is  an  open  problem  to  obtain  this 
bound  deterministically.  The  best  deterministic  construction  to  date  is  due  to  Feldman, 
Friedman,  and  Pippenger  [9]  and  uses  pins  per  chip. 


8  Further  research 

In  this  section  we  list  a  few  of  the  problems  that  have  been  left  open  by  our  research.  We 
also  describe  briefly  some  further  work  brought  on  by  an  earlier  version  [20]  of  our  work. 

In  Section  4  we  described  a  difference  cover  of  size  2  [v^]  —  1  for  the  cyclic  group  Zn. 
and  proved  that  when  n  is  the  order  of  a  projective  plane,  there  is  a  difference  cover  of 
size  It  seems  reasonable  that  any  cyclic  group  Z„  might  actually  have  a  difference 

cover  of  size  >/n  o(>/n  ),  but  we  have  been  unable  to  prove  or  disprove  this  conjecture. 

Mills  and  Wiedemann  [27]  have  computed  a  table  of  minimal  difference  covers  for  all  the 
cyclic  groups  of  cardinality  up  to  110.  For  any  value  of  n  up  to  110,  the  difference  cover 
they  find  has  at  most  \y/n'\  -b  2  elements.  They  also  provide  [28]  a  “folk  theorem”  that 
establishes  a  stronger  upper  bound  for  the  general  case  than  2  [\/^l  ■“ 

Theorem  19  Tht  set  of  n  cyclic  shifts  on  n  elements  has  a  difference  cover  of  size  (\/2  + 
o{l))y/n. 

Sketch  of  proof.  [28]  Let  q  be  the  smallest  prime  such  that  1  =  q*  +  g  +  1  >  n/2.  We 
have  g  =  (1  +  o{\))^Jnj2,  since  for  large  i,  there  exists  a  prime  between  x  and  x  o(x). 
Let  {do,  di, . . .  ,d,}  be  a  difference  cover  for  integtrsi  chosen  as  in  Theorem  12.  It  can  be 
verified  that  the  set  (do,  dj , . . . ,  d,}  U  {do  +  /,  di  -f-  d,  -f-  /}  forms  a  difference  cover  for 

Z„.  ■ 

Another  interesting  problem  related  to  cyclic  shifters  involves  finding  an  area-efficient 
VLSI  layout  of  the  cyclic  shifter  based  on  projective  planes.  In  section  4  we  presented  an 
area-efficient  layout  using  a  difference  cover  whose  size  is  twice  the  optimal  size.  Is  there 
a  good  layout  for  the  pin-optimal  design? 

In  Section  5,  we  showed  that  any  abelian  group  of  p  elements  has  a  difference  cover 
of  size  0{y/p),  and  we  showed  that  any  group  of  p  elements  has  a  difference  cover  of  size 
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0\  \  ';ilg p  ).  Finkelstein.  Kleitman  and  Leighton  [13]  have  recently  improved  our  result  for 
general  groups  to  0(^/p).  Their  proof  uses  a  folk  theorem  [8]  that  every  simple  group 
of  nonprime  order  p  has  a  subgroup  of  size  at  least  The  folk  theorem  is  proved  by 
checking  each  type  of  group  in  the  classification  theorem  [17,  pp.  135-136].  It  would  be 
interesting  to  know  if  there  is  a  more  direct  proof  that  e\ery  g'oup  ha<=  a  difference  >  mer 
of  size  0{y/p). 

To  implement  cyclic  shifters  that  operate  in  t  clock  ticks,  we  showed  how  to  construct 
a  t-difference  co%er  for  of  size  A  simpler  construction  achieves  the  bound 

Theorem  13  gives  a  lower  bound  of  [v/^1  average  number  of  pins  per 

chip  for  a  cyclic  shifter  that  operates  in  one  clock  tick.  It  may  be  possible  to  prove  a  lower 
bound  of  on  the  average  number  of  pins  per  chip  when  an  architecture  operates  in 

t  clock  ticks,  but  we  were  unable  to  extend  the  argument.  We  were  also  unable  to  extend 
either  of  these  constructions  to  give  good  f-difference  covers  for  groups,  either  general  or 
abelian.  It  would  be  interesting  to  know  whether  any  abelian  group  of  permutations  with 
p  permutations  has  a  t-difference  cover  of  size  for  any  t  >  1. 

We  have  concentrated  prima»-ily  on  permutation  sets  that  have  good  structure,  specif¬ 
ically  group  properties.  It  would  be  interesting  to  identify  other  structural  properties  of 
permutation  sots  besides  group  properties  that  allow  small  difference  covers  to  exist. 


Appendix 

For  completeness,  we  include  definitions  of  common  mathemadcal  notations  and  algebraic 
terms  used  in  the  paper.  Definitions  specific  to  the  content  of  the  paper  are  included  in 
context. 

We  adopt  the  following  notations: 

•  lA’i  denotes  the  size  of  the  set  .V. 

•  [n]  derioteb  the  set  cf  r.  integers  {1,2,.. 

•  [jJ  (floor  of  j)  denotes  the  largest  integer  that  is  smaller  than  or  equal  to  x. 

•  [j]  (ceiling  of  x)  d^otes  the  smallest  integer  that  is  larger  than  or  equal  to  x. 

•  Ig  I  denotes  logj  x. 

•  In  X  denotes  log^  x. 

•  (*) 

For  two  £isymptotically  positive  functions  /(n)  and  g{n),  we  write: 

•  /(^)  =  oigin)}  if  ]imn_oc  f{n)/g{n)  =  0. 

•  /(^)  =  0{g[n))  if  there  exists  c  >  0  and  hq,  such  that  f{n)  <  cg[n)  for  all  n  >  uq. 
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•  if  there  exists  c  >  0  and  no,  such  that  /(n)  >  cg(n)  for  al!  ii  >  Jl^J. 

•  f{n)  =  Qigin))  if  both  fin)  =  0(^{n))  and  /(n)  =  n(5(n)). 

Let  f  :  A  —*  B  he  &  function. 

•  /  is  injective  {one  to  one)  if  a  ^  6  implies  f(a)  ^  fib). 

•  /  is  surjective  (onto)  if  for  all  b  £  B,  there  exists  some  a  £  A  such  that  6  =  / (a). 

•  /  is  bijcctive  if  it  is  injective  and  surjective. 

A  group  is  a  set  of  elements  G  with  a  binary  operation  ©,  such  that  the  following 
properties  hold. 

•  Closure:  For  every  a,b  £  G,  we  have  a  Q  b  £  G. 

•  Associativity:  For  every  a,b.  c  £  G ,  we  have  a  ©  (6  ©  c)  =  (a  ©  6)  ©  c. 

•  Identity:  There  exists  an  element  e  £  G  such  that  a  ©  e  =  e  ©  a  =  a  for  all  a  G  G. 

•  Inverse:  For  every  a  £  G,  there  exists  an  element  a~^  £  G  such  that  a  ©  = 

a~^  ©  a  =  e . 

.An  abelian  groi  p  is  a  group  G  with  an  additional  property: 

•  Commutativity:  For  every  a,b  £  G,  we  have  a  ©  6  =  6  ©  a. 

W’e  often  use  the  notations: 

•  ab  to  denote  a  ©  6, 

•  a*  to  denote  a  ©  a  ©  •  •  •  ©  a  (k  times), 

•  a~*  to  denote  (a"')*. 

A  cyclic  group  G  is  a  group  in  which  there  exists  a  £  G  such  that  G  =  |a*  :  k  integer!. 
Cyclic  groups  are  abelian.  The  notation  Z„  denotes  the  cyclic  group  of  residues  modulo 
n,  with  modular  addition  as  the  group  operation.  A  permutation  on  a  set  X  is  a  bijective 
function  from  X  to  X .  All  the  possible  permutations  on  X  form  a  group  with  functional 
composition  <is  the  group  operation. 


26 


Acknowledgements 


Guy  L.  Steele,  Jr.  of  Thinking  Machines  Corporation  originally  acquainted  us  with  the 
problem  of  implementing  cyclic  shifters  with  busses.  Tom  Leighton  of  MIT  helped  sim¬ 
plify  our  proof  of  Theorem  15  and  acquainted  us  with  references  to  relevant  work  in  the 
combinatorics  literature.  Nicholas  Pippenger  of  the  University  of  British  Columbia  referred 
us  to  the  combinatorics  results  in  section  7.3.  Noga  Alon  of  Tel  Aviv  University  made  the 
observation  in  Section  5  that  the  result  of  Babai  and  Erdos  could  be  used  to  show  the 
existence  of  a  small  difference  cover  for  any  group.  Chuck  Fiduccia  of  General  Electric 
Research  Center  provided  excellent  comments  and  identified  a  few  bugs  in  an  early  version 
of  our  paper.  Dr.  I.  J.  Matrix  of  the  Massachusetts  Institute  of  Theology  acquainted  us 
with  his  related  work  [14,  pp.  65-67].  We  thank  these  individuals  for  helpful  discussions, 
as  well  as  Benny  Chor,  Lance  Fortnow,  Shafi  Goldwasser,  Phil  Klein,  and  Su-Ming  Wu  of 
.MIT.  and  Andrew  Odlyzko  of  AT^cT  Bell  Laboratories.  We  would  also  like  to  thank  the 
referees  which  provided  excellent  suggestions. 


References 

[1]  A.  .Aggarwal.  ‘‘Optimal  bounds  for  finding  maximum  on  array  of  processors  with  k 
global  buses,”  IEEE  Transactions  on  Computers,  Vol.  C-35,  No.  1,  January  1986. 
pp.  62-64. 

[2j  L.  Babai  and  P.  Erdos.  “Representation  of  group  elements  as  short  products.”  Annals 
of  Discrete  Mathematics,  Vol.  12,  1982,  pp.  27-30. 

[3]  J.  C.  Bermond,  J.  Bond,  and  C.  Peyrat,  “Interconnection  network  with  each  node  on 
two  buses.”  Proceedings  of  the  International  Colloquium  on  Parallel  Algorithms  and 
Architectures,  Marseille  Luminy,  France,  1986,  pp.  155-167. 

[4]  J.  C.  Bermond,  J.  Bond,  and  J.  F.  Scale,  “Large  hypergraphs  of  diameter  one.”  in 
Graph  Theory  and  Combinatorics,  Proc.  Coll.  Cambridge,  1983,  Academic  Press, 
London, 1984,  pp.  19-28. 

[5]  Y.  Birk,  Concurrent  Communication  among  Multi- Transceiver  Stations  over  Shared 
Media,  Ph.D.  dissertation,  Stanford  University,  March  1987. 

[6]  G.  S.  Bloom  and  S.  W.  Golomb,  “Numbered  complete  graphs,  unusual  rulers,  and 
assorted  applications,”  in  Theory  and  Applications  of  Graphs,  Y.  Alavi  and  D.  R. 
Lick,  eds..  Springer- Verlag,  New  York,  1978. 

[7]  S.  H.  Bokhari,  “Finding  maximum  on  an  array  processor  with  a  global  bus,”  IEEE 
Transactions  on  Computers,  Vol.  C-33,  No.  2,  February  1984,  pp.  133-139. 

[8]  W.  Feit,  private  communications,  1987. 


27 


[9]  P.  Feldman,  J.  Friedman,  and  N.  Pippenger,  “Wide-sense  nonblocking  networks." 
SIAM  Journal  of  Discrtte  Mathematics,  Vol.  1,  No.  2,  May  1988,  pp.  158-173. 

[10]  C.  M.  Fiduccia,  public  communication,  MIT.  1984. 

[11]  C.  M.  Fiduccia,  private  communication,  April  1987. 

[12]  C.  M.  Fiduccia,  “A  bussed  hypercube  and  other  optimal  permutation  networks,” 
presented  at  the  Jth  SIAM  Conference  on  Discrete  Mathematics,  June  198S. 

[13]  L.  Finkelstein,  D.  Kleitman,  and  T.  Leighton,  “Applying  the  classification  theorem 
for  finite  simple  groups  to  minimize  pin  count  in  uniform  permutation  architectures.” 
in  VLSI  Algorithms  and  Architectures,  Lecture  Notes  in  Computer  Science,  Vol  319, 
J.  H.  Reif,  ed..  Springer- Verlag,  New  York,  1988,  pp.  247-256. 

[14]  M.  Gardner,  The  Incredible  Dr.  Matrix,  Charles  Scribner’s  Sons,  New  York,  1976. 

[15]  L.  A.  Glasser  and  D.  W.  Dobberpuhl,  The  Design  and  Analysis  of  \’LSI  Circuits, 
.Addison-Wesley,  Reading,  Massachusetts,  1985. 

[16]  S.  W.  Golomb,  “How  to  number  a  graph,”  in  Graph  Theory  and  Computing.  R.  C. 
Read,  ed.,  Academic  Press,  New  York,  1972,  pp.  23-37. 

[17]  D.  Gorenstein,  Finite  Simple  Groups,  Plenum  Press,  New  York,  1982. 

[18]  M.  Hall,  Jr.,  Combinatorial  Theory,  Blaisdell  Publishing  Company,  Waltham.  Mas¬ 
sachusetts,  1967. 

[19]  G.  H.  Hardy  and  E.  M.  Wright,  An  Introduction  to  the  Theory  of  Numbers,  Oxford 
University  Press,  London,  1938. 

[20]  J.  Kilian,  S.  Kipnis,  and  C.  E.  Leiserson,  “The  organization  of  permutation  archi¬ 
tectures  with  bussed  interconnections,”  28th  Annual  Symposium  on  Foundations  of 
Computer  Science,  IEEE,  October  12-14,  1987,  pp.  305-315. 

[21]  T.  Lang,  M.  Valero,  and  M.  A.  Fiol,  “Reduction  of  connections  for  multibus  organiza¬ 
tion,”  IEEE  Transactions  on  Computers,  Vol.  C-32,  No.  8,  August  1983,  pp.  707-715. 

[22]  J.  Leech,  “On  the  representc  tion  of  1,2, ..  .,n  by  differences,"  Journal  of  the  London 
Mathematical  Society,  Vol.  31,  1956,  pp.  160-169. 

[23]  D.  J.  Lewis,  Introduction  To  Algebra,  Harper  and  Row,  New  York,  1965. 

[24]  R.  J.  Lipton  and  R.  Sedgewick,  “Lower  bounds  for  VLSI,”  13th  Annual  Symposium 
on  Theory  of  Computing,  ACM,  May  11-13,  1981,  pp.  300-307. 

[25]  M.  D.  Mickunas,  “Using  projective  geometry  to  design  bus  connection  networks," 
Proceedings  of  the  Workshop  on  Interconnection  Networks  for  Parallel  and  Distributed 
Processing,  ACM/IEEE,  April  21-22,  1980,  pp.  47-55. 


28 


[26]  J.  C.  P.  Miller,  “Difference  bases,  three  problems  in  additive  number  theory."  in 
Computers  in  Number  Theory,  A.  0.  L.  Atkin  and  B.  J.  Birch,  eds.,  Academic  Press. 
London,  1971,  pp.  299-322. 

[27]  \V.  H,  Mills  and  D.  H.  Wiedemann,  “A  table  of  difference  coverings,”  unpublished 
abstract,  Institute  for  Defense  Analyses,  Communications  Research  Division,  January 
1988. 

[28]  D.  H.  Wiedemann,  private  communication,  November  1988. 

[29]  Q.  F.  Stout,  “Meshes  with  multiple  busses,”  27th  Annual  Symposium  on  Foundations 
of  Computer  Science,  IEEE,  October  27-29,  1986,  pp.  264-273. 

[30]  J.  D.  UUman,  Computational  Aspects  of  VLSI,  Computer  Science  Press,  Rockville. 
Maryland,  1984. 

[31]  J.  H.  van  Lint,  “Solutions:  Problem  350,”  Nieuw  Archief  voor  Wiskunde,  \’ol.  22, 
1974,  pp.  94-109. 


29 


